78

Algorithms for Binary Neural Networks

FIGURE 3.23

We demonstrate the kernel weight distribution of the first binarized convolution layer of

BONNs. Before training, we initialize the kernels as a single-mode Gaussian distribution.

From the 2-th epoch to the 200-th epoch, with λ fixed to 1e4, the distribution of the

kernel weights becomes more and more compact with two modes, which confirms that the

Bayesian kernel loss can regularize the kernels into a promising distribution for binarization.

two-mode GMM style. Figure 3.25 shows the evolution of the binarized values during the

training process of XNOR-Net and BONN. The two different patterns indicate that the

binarized values learned in BONN are more diverse.

Effectiveness of Bayesian Feature Loss on Real-Valued Models: We apply our

Bayesian feature loss on real-value models, including ResNet-18 and ResNet-50 [84]. We

retrain these two backbones with our Bayesian feature loss for 70 epochs. We set the hy-

perparameter θ to 1e3. The SGD optimizer has an initial learning rate set to 0.1. We use

FIGURE 3.24

The weight distributions of XNOR and BONN are based on WRN-22 (2nd, 8th, and 14th

convolutional layers) after 200 epochs. The weight distribution difference between XNOR

and BONN indicates that the kernels are regularized across the convolutional layers with

the proposed Bayesian kernel loss.